698 research outputs found
Deep Over-sampling Framework for Classifying Imbalanced Data
Class imbalance is a challenging issue in practical classification problems
for deep learning models as well as traditional models. Traditionally
successful countermeasures such as synthetic over-sampling have had limited
success with complex, structured data handled by deep learning models. In this
paper, we propose Deep Over-sampling (DOS), a framework for extending the
synthetic over-sampling method to exploit the deep feature space acquired by a
convolutional neural network (CNN). Its key feature is an explicit, supervised
representation learning, for which the training data presents each raw input
sample with a synthetic embedding target in the deep feature space, which is
sampled from the linear subspace of in-class neighbors. We implement an
iterative process of training the CNN and updating the targets, which induces
smaller in-class variance among the embeddings, to increase the discriminative
power of the deep representation. We present an empirical study using public
benchmarks, which shows that the DOS framework not only counteracts class
imbalance better than the existing method, but also improves the performance of
the CNN in the standard, balanced settings
WTEN: An advanced coupled tensor factorization strategy for learning from imbalanced data
© Springer International Publishing AG 2016. Learning from imbalanced and sparse data in multi-mode and high-dimensional tensor formats efficiently is a significant problem in data mining research. On one hand,Coupled Tensor Factorization (CTF) has become one of the most popular methods for joint analysis of heterogeneous sparse data generated from different sources. On the other hand,techniques such as sampling,cost-sensitive learning,etc. have been applied to many supervised learning models to handle imbalanced data. This research focuses on studying the effectiveness of combining advantages of both CTF and imbalanced data learning techniques for missing entry prediction,especially for entries with rare class labels. Importantly,we have also investigated the implication of joint analysis of the main tensor and extra information. One of our major goals is to design a robust weighting strategy for CTF to be able to not only effectively recover missing entries but also perform well when the entries are associated with imbalanced labels. Experiments on both real and synthetic datasets show that our approach outperforms existing CTF algorithms on imbalanced data
MaaSim: A Liveability Simulation for Improving the Quality of Life in Cities
Urbanism is no longer planned on paper thanks to powerful models and 3D
simulation platforms. However, current work is not open to the public and lacks
an optimisation agent that could help in decision making. This paper describes
the creation of an open-source simulation based on an existing Dutch
liveability score with a built-in AI module. Features are selected using
feature engineering and Random Forests. Then, a modified scoring function is
built based on the former liveability classes. The score is predicted using
Random Forest for regression and achieved a recall of 0.83 with 10-fold
cross-validation. Afterwards, Exploratory Factor Analysis is applied to select
the actions present in the model. The resulting indicators are divided into 5
groups, and 12 actions are generated. The performance of four optimisation
algorithms is compared, namely NSGA-II, PAES, SPEA2 and eps-MOEA, on three
established criteria of quality: cardinality, the spread of the solutions,
spacing, and the resulting score and number of turns. Although all four
algorithms show different strengths, eps-MOEA is selected to be the most
suitable for this problem. Ultimately, the simulation incorporates the model
and the selected AI module in a GUI written in the Kivy framework for Python.
Tests performed on users show positive responses and encourage further
initiatives towards joining technology and public applications.Comment: 16 page
A matter of words: NLP for quality evaluation of Wikipedia medical articles
Automatic quality evaluation of Web information is a task with many fields of
applications and of great relevance, especially in critical domains like the
medical one. We move from the intuition that the quality of content of medical
Web documents is affected by features related with the specific domain. First,
the usage of a specific vocabulary (Domain Informativeness); then, the adoption
of specific codes (like those used in the infoboxes of Wikipedia articles) and
the type of document (e.g., historical and technical ones). In this paper, we
propose to leverage specific domain features to improve the results of the
evaluation of Wikipedia medical articles. In particular, we evaluate the
articles adopting an "actionable" model, whose features are related to the
content of the articles, so that the model can also directly suggest strategies
for improving a given article quality. We rely on Natural Language Processing
(NLP) and dictionaries-based techniques in order to extract the bio-medical
concepts in a text. We prove the effectiveness of our approach by classifying
the medical articles of the Wikipedia Medicine Portal, which have been
previously manually labeled by the Wiki Project team. The results of our
experiments confirm that, by considering domain-oriented features, it is
possible to obtain sensible improvements with respect to existing solutions,
mainly for those articles that other approaches have less correctly classified.
Other than being interesting by their own, the results call for further
research in the area of domain specific features suitable for Web data quality
assessment
Improving Attitude Words Classification for Opinion Mining using Word Embedding
[EN] Recognizing and classifying evaluative expressions is an
important issue of sentiment analysis. This paper presents a corpus-based method for classifying attitude types (Affect, Judgment and Appreciation) and attitude orientation (positive and negative) of words in Spanish relying on the Attitude system of the Appraisal Theory. The main contribution lies in exploring large and unlabeled corpora using neural network word embedding techniques in order to obtain semantic information among words of the same attitude and orientation class. Experimental results show that the proposed method achieves a good effectiveness and outperforms the state of the art for automatic classification of attitude words in Spanish language.The work of the fourth author was partially supported by the
SomEMBED TIN2015-71147-C2-1-P research project (MINECO/FEDER).Ortega-Bueno, R.; Medina-Pagola, JE.; Muñiz-Cuza, CE.; Rosso, P. (2019). Improving Attitude Words Classification for Opinion Mining using Word Embedding. Lecture Notes in Computer Science. 11401:971-982. https://doi.org/10.1007/978-3-030-13469-3_112S9719821140
A swarm intelligence approach in undersampling majority class
Over the years, machine learning has been facing the issue of imbalance dataset. It occurs when the number of instances in one class significantly outnumbers the instances in the other class. This study investigates a new approach for balancing the dataset using a swarm intelligence technique, Stochastic Diffusion Search (SDS), to undersample the majority class on a direct marketing dataset. The outcome of the novel application of this swarm intelligence algorithm demonstrates promising results which encourage the possibility of undersampling a majority class by removing redundant data whist protecting the useful data in the dataset. This paper details the behaviour of the proposed algorithm in dealing with this problem and investigates the results which are contrasted against other techniques
A critical look at studies applying over-sampling on the TPEHGDB dataset
Preterm birth is the leading cause of death among young children and has a large prevalence globally. Machine learning models, based on features extracted from clinical sources such as electronic patient files, yield promising results. In this study, we review similar studies that constructed predictive models based on a publicly available dataset, called the Term-Preterm EHG Database (TPEHGDB), which contains electrohysterogram signals on top of clinical data. These studies often report near-perfect prediction results, by applying over-sampling as a means of data augmentation. We reconstruct these results to show that they can only be achieved when data augmentation is applied on the entire dataset prior to partitioning into training and testing set. This results in (i) samples that are highly correlated to data points from the test set are introduced and added to the training set, and (ii) artificial samples that are highly correlated to points from the training set being added to the test set. Many previously reported results therefore carry little meaning in terms of the actual effectiveness of the model in making predictions on unseen data in a real-world setting. After focusing on the danger of applying over-sampling strategies before data partitioning, we present a realistic baseline for the TPEHGDB dataset and show how the predictive performance and clinical use can be improved by incorporating features from electrohysterogram sensors and by applying over-sampling on the training set
Fall Detection Analysis Using a Real Fall Dataset
International Conference on Soft Computing Models in Industrial and Environmental Applications (13th. 2018. San Sebastián
On the suitability of resampling techniques for the class imbalance problem in credit scoring
In real-life credit scoring applications, the case in which the class of defaulters is under-represented in comparison with the class of non-defaulters is a very common situation, but it has still received little attention. The present paper investigates the suitability and performance of several resampling techniques when applied in conjunction with statistical and artificial intelligence prediction models over five real-world credit data sets, which have artificially been modified to derive different imbalance ratios (proportion of defaulters and non-defaulters examples). Experimental results demonstrate that the use of resampling methods consistently improves the performance given by the original imbalanced data. Besides, it is also important to note that in general, over-sampling techniques perform better than any under-sampling approach.This work has partially been supported by the Spanish Ministry of Education and Science under grant TIN2009– 14205 and the Generalitat Valenciana under grant PROMETEO/2010/ 028
- …